An Average-Reward Reinforcement Learning Algorithm for Computing Bias-Optimal Policies

نویسنده

  • Sridhar Mahadevan
چکیده

Computing Bias-Optimal Policies Sridhar Mahadevan Department of Computer Science and Engineering University of South Florida Tampa, Florida 33620 [email protected] Abstract Average-reward reinforcement learning (ARL) is an undiscounted optimality framework that is generally applicable to a broad range of control tasks. ARL computes gain-optimal control policies that maximize the expected payo per step. However, gainoptimality has some intrinsic limitations as an optimality criterion, since for example, it cannot distinguish between di erent policies that all reach an absorbing goal state, but incur varying costs. A more selective criterion is bias optimality, which can lter gain-optimal policies to select those that reach absorbing goals with the minimum cost. While several ARL algorithms for computing gain-optimal policies have been proposed, none of these algorithms can guarantee bias optimality, since this requires solving at least two nested optimality equations. In this paper, we describe a novel model-based ARL algorithm for computing bias-optimal policies. We test the proposed algorithm using an admission control queuing system, and show that it is able to utilize the queue much more e ciently than a gain-optimal method by learning bias-optimal policies. Motivation Recently, there has been growing interest in an undiscounted optimality framework called average reward reinforcement learning (ARL) (Boutilier & Puterman 1995; Mahadevan 1994; 1996a; Schwartz 1993; Singh 1994; Tadepalli & Ok 1994). ARL is wellsuited to many cyclical control tasks, such as a robot avoiding obstacles (Mahadevan 1996a), an automated guided vehicle (AGV) transporting parts (Tadepalli & Ok 1994), and for process-oriented planning tasks (Boutilier & Puterman 1995), since the average reward is a good metric to evaluate performance in these tasks. However, one problem with the average reward criterion is that it is not su ciently selective, both in goal-based tasks and tasks with no absorbing goals. Figure 1 illustrates the limitation of the average reward criterion on a simple two-dimensional grid-world task. Here, the learner is continually rewarded by +10 for reaching and staying in the absorbing goal stateG, and is rewarded 1 in all non-goal states. Clearly, all control policies that reach the goal will have the same average reward. Thus, the average reward criterion cannot be used to select policies that reach absorbing goals in the shortest time.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Average - Reward Reinforcement Learning

Recently, there has been growing interest in average-reward reinforcement learning (ARL), an undiscounted optimality framework that is applicable to many diierent control tasks. ARL seeks to compute gain-optimal control policies that maximize the expected payoo per step. However, gain-optimality has some intrinsic limitations as an optimality criterion, since for example, it cannot distinguish ...

متن کامل

n Average - ent Learning

Average-reward reinforcement learning (ARL) is an undiscounted optimality framework that is generally applicable to a broad range of control tasks. ARL computes gain-optimal control policies that maximize the expected payoff per step. However, gainoptimality has some intrinsic limitations as an optimality criterion, since for example, it cannot distinguish between different policies that all re...

متن کامل

Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results Editor: Leslie Kaelbling

This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asyn-chronous algorithms from optimal co...

متن کامل

Manufactured in The Netherlands . Average Reward Reinforcement Learning : Foundations , Algorithms , and Empirical

This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asyn-chronous algorithms from optimal co...

متن کامل

Hierarchically Optimal Average Reward Reinforcement Learning

Two notions of optimality have been explored in previous work on hierarchical reinforcement learning (HRL): hierarchical optimality, or the optimal policy in the space defined by a task hierarchy, and a weaker local model called recursive optimality. In this paper, we introduce two new average-reward HRL algorithms for finding hierarchically optimal policies. We compare them to our previously r...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996